R code for this article

(Previously: Part 1 of this series)

This post was inspired by the chapter ‘Penguins, Pessimists, and Paradoxes’ in the book Probably Overthinking It by Allen Downey.

How should we work out which of two surgical procedures are more effective for removing kidney stones? One approach might be to an collect data on patients who have undergone the first procedure, and patients who have had the second type of surgery, and compare the proportion of each group who still had kidney stones three months after their treatment. This is the approach that a group of urologists took in a 1986 paper published in The BMJ comparing the success rate of open surgery and a less invasive procedure called percutaneous nephrolithotomy in two groups of 350 patients1 (they actually also looked at another treatment that uses powerful sound waves to break up kidney stones, but in this article I will focus on the two surgical procedures). When collecting their data, the researchers also recorded which of the patients had kidney stones with a diameter smaller than 2cm before surgery and which had larger stones. I have summarised their findings in a data frame below.

# First we copy the data from the 1986 paper into R:
OS_surgeries <- 350 # the total number of open surgeries performed
OS_successes <- 273 # the total number of open surgeries which were successful
OS_large_surgeries <- 263 # the number of open surgeries performed on kidney stones smaller than 2cm in diameter
OS_large_successes <- 192 # the number of open surgeries on kidney stones smaller than 2cm in diameter which were successful
OS_small_surgeries <- OS_surgeries - OS_large_surgeries # the number of open surgeries performed on larger kidney stones 
OS_small_successes <- OS_successes - OS_large_successes # the number of open surgeries on larger kidney stones which were successful
PN_surgeries <- 350 # the total number of percutaneous nephrolithotomies performed
PN_successes <- 289 # the number of percutaneous nephrolithotomies which were successful
PN_large_surgeries <- 80 # the number of percutaneous nephrolithotomies performed on kidney stones smaller than 2cm in diameter
PN_large_successes <- 55 # the number of percutaneous nephrolithotomies on kidney stones smaller than 2cm in diameter which were successful
PN_small_surgeries <- PN_surgeries - PN_large_surgeries # the number of percutaneous nephrolithotomies performed on larger kidney stones 
PN_small_successes <- PN_successes - PN_large_successes # the number of percutaneous nephrolithotomies on larger kidney stones which were successful
# Now we can calculate the success rates and collect the results in a data frame:
kidney_df <- data.frame(
  Stone.Size = c("Small", "Large", "Both"),
  OS.Success.Rate = c(OS_small_successes/OS_small_surgeries, OS_large_successes/OS_large_surgeries, OS_successes/OS_surgeries),
  PN.Success.Rate = c(PN_small_successes/PN_small_surgeries, PN_large_successes/PN_large_surgeries, PN_successes/PN_surgeries)
)
print(kidney_df)
##   Stone.Size OS.Success.Rate PN.Success.Rate
## 1      Small       0.9310345       0.8666667
## 2      Large       0.7300380       0.6875000
## 3       Both       0.7800000       0.8257143

What conclusion should we draw here? When we consider the success rate of the surgeries on both types of stone, percutaneous nephrolithotomy comes out on top. But in both small and large stones, open surgery was more successful. What is going on?!

This study is a famous example of Simpson’s paradox, a phenomenon in which an effect that occurs in groups of data disappears or even reverses when the data are aggregated. Simpson’s paradox can happen when we investigate the relationship between an exposure and an outcome – in this case, the type of surgery performed and the result of the operation – without accounting for a third, ‘confounding’, variable that affects both the exposure and the outcome.

In the raw study data, we can see that patients with small kidney stones received a percutaneous nephrolithotomy much more commonly than open surgery. And from our calculated success rates we know that operations performed on small stones were more often successful than those carried out on large stones, regardless of the procedure used. So, in this study, stone size was a confounding variable. Had the researchers ignored it, they might have reached the spurious conclusion that their data showed percutaneous nephrolithotomy to be more effective at removing kidney stones than open surgery. We can represent this confounding effect using a probabilistic graphical model to show which variables influence each other.

To demonstrate this effect, we can run an R simulation and make Simpson’s paradox show up for ourselves. Let’s simulate one million patients with kidney stones, and assume for the sake of our experiment that they are equally likely to have small stones as large stones. In our example we’ll pretend that open surgery and percutaneous nephrolithotomy have exactly the same chance of success; 90%, on small stones, and 70% on large stones. But crucially, we’ll make doctors three times as likely to perform a percutaneous nephrolithotomy as an open surgery on small stones, and vice versa on large stones. This could be for any number of reasons – for example, perhaps patients with smaller stones are more inclined to be concerned about the side effects of open surgery, or perhaps there is a longer waiting time for percutaneous nephrolithotomy that patients with larger stones tend to be less willing to tolerate. (These are just examples of possible explanations, not necessarily the true reasons that the size of kidney stones affected doctors and patients’ choice of treatment in the 1986 study.)

# First we initialise a matrix to store the results we get for each patient's simulation:
simulated_patient_data <- matrix(rep(FALSE,3000000), nrow = 1000000, byrow = TRUE)
colnames(simulated_patient_data) <- c("Large Stone?", "Open Surgery?", "Treatment Success?")
for (i in 1:1000000) {
  # To give a patient a large kidney stone with a probability of 50%
  # we can randomly generate a real number between 0 and 1 
  # and give them an accurate result if it is less than or equal to 0.5:
  decide_stone_size <- runif(1, min=0, max=1)
  if (decide_stone_size <= 0.5) {
    simulated_patient_data[i,1] <- TRUE
    decide_treatment <- runif(1, min=0, max=1)
    # To make a patient with a large kidney stone three times more likely to receive an open surgery than a percutaneous nephrolithotomy
    # we can randomly generate a real number between 0 and 1 
    # and give them an open surgery if it is less than or equal to 0.75,
    # and a percutaneous nephrolithotomy if it is greater than 0.75:
    if (decide_treatment <= 0.75) {
      simulated_patient_data[i,2] <- TRUE
    } 
    # To give a patient with a large kidney stone a 70% chance of having a successful surgery
    # we can randomly generate a real number between 0 and 1 
    # and give them successful result if it is less than or equal to 0.7,
    # and an unsuccessful result if it is greater than 0.7:
    decide_success <- runif(1, min=0, max=1)
    if (decide_success <= 0.7) {
      simulated_patient_data[i,3] <- TRUE
    } 
  } else {
    # To make a patient with a small kidney stone three times more likely to receive a percutaneous nephrolithotomy than an open suregery
    # we can randomly generate a real number between 0 and 1 
    # and give them an open surgery if it is less than or equal to 0.25,
    # and a percutaneous nephrolithotomy if it is greater than 0.25:
    decide_treatment <- runif(1, min=0, max=1)
    if (decide_treatment <= 0.25) {
      simulated_patient_data[i,2] <- TRUE
    } 
    # To give a patient with a large kidney stone a 90% chance of having a successful surgery
    # we can randomly generate a real number between 0 and 1 
    # and give them successful result if it is less than or equal to 0.9,
    # and an unsuccessful result if it is greater than 0.9:
    decide_success <- runif(1, min=0, max=1)
    if (decide_success <= 0.9) {
      simulated_patient_data[i,3] <- TRUE
    } 
  }
}

Let’s take a look at the results for the first 20 of our simulated patients.

head(simulated_patient_data, n=20)
##       Large Stone? Open Surgery? Treatment Success?
##  [1,]        FALSE         FALSE               TRUE
##  [2,]         TRUE          TRUE               TRUE
##  [3,]         TRUE          TRUE              FALSE
##  [4,]        FALSE          TRUE               TRUE
##  [5,]        FALSE          TRUE               TRUE
##  [6,]         TRUE          TRUE              FALSE
##  [7,]        FALSE         FALSE               TRUE
##  [8,]        FALSE         FALSE               TRUE
##  [9,]        FALSE          TRUE              FALSE
## [10,]         TRUE          TRUE              FALSE
## [11,]         TRUE          TRUE               TRUE
## [12,]         TRUE         FALSE               TRUE
## [13,]         TRUE          TRUE               TRUE
## [14,]        FALSE         FALSE               TRUE
## [15,]        FALSE         FALSE               TRUE
## [16,]         TRUE          TRUE               TRUE
## [17,]         TRUE          TRUE               TRUE
## [18,]         TRUE          TRUE              FALSE
## [19,]         TRUE          TRUE              FALSE
## [20,]        FALSE          TRUE               TRUE

Now let’s count the numbers of procedures performed and successful outcomes for each type of surgery and stone size in our simulation.

sim_surgeries <- 1000000
sim_successes <- sum(simulated_patient_data[,3])
sim_large_surgeries <- sum(simulated_patient_data[,1])
sim_large_successes <- sum(simulated_patient_data[,1] & simulated_patient_data[,3])
sim_small_surgeries <- sim_surgeries - sim_large_surgeries
sim_small_successes <- sim_successes - sim_large_successes
sim_OS_surgeries <- sum(simulated_patient_data[,2])
sim_OS_successes <- sum(simulated_patient_data[,2] & simulated_patient_data[,3])
sim_OS_large_surgeries <- sum(simulated_patient_data[,1] & simulated_patient_data[,2])
sim_OS_large_successes <- sum(simulated_patient_data[,1] & simulated_patient_data[,2] & simulated_patient_data[,3])
sim_OS_small_surgeries <- sim_OS_surgeries - sim_OS_large_surgeries 
sim_OS_small_successes <- sim_OS_successes - sim_OS_large_successes 
sim_PN_surgeries <- sim_surgeries - sim_OS_surgeries
sim_PN_successes <- sim_successes - sim_OS_successes
sim_PN_large_surgeries <- sim_large_surgeries - sim_OS_large_surgeries
sim_PN_large_successes <- sim_large_successes - sim_OS_large_successes
sim_PN_small_surgeries <- sim_small_surgeries - sim_OS_small_surgeries
sim_PN_small_successes <- sim_small_successes - sim_OS_small_successes
cat(paste0("The total number of surgeries performed on large kidney stones was ", sim_large_surgeries, ", of which ", sim_large_successes, " were successful.\n", 
           "The total number of surgeries performed on small kidney stones was ", sim_small_surgeries, ", of which ", sim_small_successes, " were successful.\n\n",
           "The total number of open surgeries performed was ", sim_OS_surgeries, ", of which ", sim_OS_successes, " were successful.\n",
           "Of these open surgeries, ", sim_OS_large_surgeries, " were on large kidney stones, with ", sim_OS_large_successes, " successful outcomes,\n",
           "and ", sim_OS_small_surgeries, " were on small kidney stones, with ", sim_OS_small_successes, " successful outcomes.\n\n",
           "The total number of percutaneous nephrolithotomies performed was ", sim_PN_surgeries, ", of which ", sim_PN_successes, " were successful.\n",
           "Of these percutaneous nephrolithotomies, ", sim_PN_large_surgeries, " were on large kidney stones, with ", sim_PN_large_successes, " successful outcomes,\n",
           "and ", sim_PN_small_surgeries, " were on small kidney stones, with ", sim_PN_small_successes, " successful outcomes."))
## The total number of surgeries performed on large kidney stones was 499892, of which 349973 were successful.
## The total number of surgeries performed on small kidney stones was 500108, of which 449854 were successful.
## 
## The total number of open surgeries performed was 500113, of which 375070 were successful.
## Of these open surgeries, 374766 were on large kidney stones, with 262390 successful outcomes,
## and 125347 were on small kidney stones, with 112680 successful outcomes.
## 
## The total number of percutaneous nephrolithotomies performed was 499887, of which 424757 were successful.
## Of these percutaneous nephrolithotomies, 125126 were on large kidney stones, with 87583 successful outcomes,
## and 374761 were on small kidney stones, with 337174 successful outcomes.

We can now calculate the success rates for both types of surgery according to stone size and store the results in a data frame like the kidney_df one I used to store the data from the 1986 paper.

sim_kidney_df <- data.frame(
  Stone.Size = c("Small", "Large", "Both"),
  OS.Success.Rate = c(sim_OS_small_successes/sim_OS_small_surgeries, sim_OS_large_successes/sim_OS_large_surgeries, sim_OS_successes/sim_OS_surgeries),
  PN.Success.Rate = c(sim_PN_small_successes/sim_PN_small_surgeries, sim_PN_large_successes/sim_PN_large_surgeries, sim_PN_successes/sim_PN_surgeries)
)
print(sim_kidney_df)
##   Stone.Size OS.Success.Rate PN.Success.Rate
## 1      Small       0.8989445       0.8997041
## 2      Large       0.7001436       0.6999584
## 3       Both       0.7499705       0.8497060

In this computer experiment, we know that the probability of success for both procedures was the same when applied to the same category of stone, because of how we set up our simulation. This is borne out in the success rates for large and small kidney stones – they are virtually identical for open surgery and percutaneous nephrolithotomy. However, in the overall success rate we can see Simpson’s paradox rearing it’s head; if we were to ignore stone size, it would appear that open surgery was much less effective than percutaneous nephrolithotomy.

Simpson’s paradox has been observed in many disparate areas of medicine, from endocrinology to psychiatry2, illustrating the widespread danger of failing to control for confounding variables. The risk of confounding is best addressed by carefully designed randomised controlled trials (RCTs), which is one of the reasons that they are highly ranked in hierarchies of evidence. However, it may not always be ethical or feasible to study the effect of an intervention using an RCT, in which case researchers use statistical techniques such as stratification and propensity score matching to try to reduce the effects of confounding variables3.

Home Page


  1. C R Charig et al., “Comparison of treatment of renal calculi by open surgery, percutaneous nephrolithotomy, and extracorporeal shockwave lithotripsy,” British Medical Journal (Clinical Research Edition) 292, no. 6524 (March 29, 1986): 879–882.↩︎

  2. Steven Julious and Mark Mullee, “Confounding and Simpson’s paradox,” BMJ 309, no. 6967 (December 3 1994): 1480–1481.↩︎

  3. Jeffrey Aronson, Clare Bankhead, and David Nunan, “Confounding,” Catalogue of Bias, 2018.↩︎